In [3]:
import IPython.display as disp
import pandas as pd
import numpy as np
import ggplot as gp
import scipy.stats as st
import statsmodels.api as sm
import math
import geopy.distance as gdist
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Mann-Whitney U
Two-tailed vs One-tailed
The Mann-Whitney U-Test was used to analyze the NYC subway data. A two-tail P value was used because due to the uncertainty of whether rain increases or decreases the number of people using the subway. The Null Hypothesis used to compare between the two populations, rainy days and non rainy days is as follows:
$H_0$: The distribution of hourly entries are the same for the two populations.
$H_A$: The distribution of hourly entries are different for the two populations.
Justification for this hypothesis will be explained in the next question. A two tailed p-critical value of .05 will be used.
In order to choose an appropriate statistical test for the dataset, the shape of the distributions needed to be graphically shown:
In [8]:
df = pd.read_csv("./turnstile_weather_v2.csv")
df_rain = df[df['rain'] == 1]
df_no_rain = df[df['rain'] == 0]
df_rain_entries = df_rain.ENTRIESn_hourly
df_no_rain_entries = df_no_rain.ENTRIESn_hourly
gp.ggplot(gp.aes(x="ENTRIESn_hourly", fill='rain',color='rain'), data=df) +\
gp.geom_histogram(alpha=0.6, binwidth=1000)
Out[8]:
By looking at the histogram we see that the datasets seem to not follow a normal distribution. We can verify this statistically with the Shapiro Wilk's Test.
In [439]:
print(st.shapiro(df_rain_entries))
print(st.shapiro(df_no_rain_entries))
We can make a safe assumption that the populations are not normal after analyzing it visually and statistically.
In 1.1, the hypotheses are comparing the distributions of the two populations because we are making an assumption that the distributions are similarily shaped, as shown in the histogram above.
In [440]:
with_rain_mean = round(df_rain_entries.mean(),2)
without_rain_mean = round(df_no_rain_entries.mean(),2)
U,p = st.mannwhitneyu(df_rain_entries, df_no_rain_entries)
p = round(p*2,7)
Mean Hourly Entries while raining: {{with_rain_mean}}
Mean Hourly Entries while not raining: {{without_rain_mean}}
U Statistic: {{U}}
p-value: {{p}}
Since our p-value of {{p}} is less than 0.05, we can reject the Null Hypothesis saying that the the two distributions are identical. We can conclude that rain does in fact play a factor for hourly entries in the NYC subway. Since the mean hourly entries while raining is greater than the mean hourly entries while not raining, we can also say that rain does in fact increase NYC Subway ridership.
In [441]:
def linear_regression(features, values):
features = sm.add_constant(features)
model = sm.OLS(values, features)
return model.fit()
def predictions(dataframe):
features = dataframe[['tempi', 'precipi', 'meanprecipi', 'pressurei',
'meanpressurei', 'wspdi', 'meanwspdi']]
dummy_units = pd.get_dummies(dataframe['UNIT'], prefix='unit')
features = features.join(dummy_units)
dummy_units = pd.get_dummies(dataframe['conds'], prefix='cond')
features = features.join(dummy_units)
dummy_units = pd.get_dummies(dataframe['hour'], prefix='hour')
features = features.join(dummy_units)
dummy_units = pd.get_dummies(dataframe['day_week'], prefix='day')
features = features.join(dummy_units)
# Values
values = dataframe['ENTRIESn_hourly']
# Perform linear regression
results = linear_regression(features, values)
intercept, params = results.params[0], results.params[1:]
predictions = intercept + np.dot(features, params)
return results, predictions
In [442]:
results, predictions = predictions(df)
Input Variables:
Dummy Variables:
Features were selected in the beginning solely by intuition. For any features with categorical values, dummy values were used.
By investigating the values in the weather conditions column, I found that the indicators for fog, rain were not necessary since the weather condition column already captured that in one of the categories.
The hours feature was later converted into a categorical input since I suspected that a coefficient could not capture its effect on the the fluctuations in the number of hourly entries.
It was the same suspicion for adding the day_week feature as a dummy variable too, rendering the weekday indicator obsolete.
In [443]:
results.params["tempi":"meanwspdi"]
Out[443]:
Conditions non-dummy variable:
In [444]:
results.params["cond_Clear":"day_6"]
Out[444]:
In [445]:
r_squared = round(results.rsquared,4)
r_squared_pct = r_squared * 100
r_squared
Out[445]:
Our $r^2$ value tells us that {{r_squared_pct}}% of the variation of hourly entries for the NYC subway is explained by the features provided. It indicates that there is a relationship between the features listed above and NYC ridership.
In [446]:
def plot_residuals(turnstile_weather, predictions):
turnstile_weather["Hourly Entries Residuals"] = turnstile_weather["ENTRIESn_hourly"] - predictions
cropped = turnstile_weather[(turnstile_weather["Hourly Entries Residuals"] > -5000) & (turnstile_weather["Hourly Entries Residuals"] < 5000)]
plt = gp.ggplot(gp.aes(x="Hourly Entries Residuals"), data=cropped) + gp.geom_histogram(binwidth=100)
return plt
print(plot_residuals(df, predictions))
entries_mean = round(df['ENTRIESn_hourly'].mean(),2)
We can take a look at what the $r^2$ value represents by looking at the residuals. Majority of the residuals lie between -2000 to 2000, which is not good considering the mean is {{entries_mean}} for subway ridership. Therefore I think that this linear model is not enough to predict subway ridership.
Please include two visualizations that show the relationships between two or more variables in the NYC subway data. Remember to add appropriate titles and axes labels to your plots. Also, please add a short description below each figure commenting on the key insights depicted in the figure.
In [447]:
gp.ggplot(df, gp.aes(x="ENTRIESn_hourly", fill="rain", color="rain")) + gp.geom_histogram()
Out[447]:
In [448]:
grouped = df[["hour","ENTRIESn_hourly"]].groupby("hour")
ridership_hour = grouped.aggregate(np.mean)
ridership_hour = ridership_hour.reset_index()
gp.ggplot(ridership_hour, gp.aes('hour', weight="ENTRIESn_hourly")) + \
gp.geom_bar(binwidth=4) + \
gp.scale_y_continuous(labels='comma') + \
gp.scale_x_continuous(limits=(0,20)) + \
gp.ggtitle('Average Ridership by Hour of the Day')
Out[448]:
In [449]:
def convertDays(x):
if x == 0:
return "Mon"
elif x == 1:
return "Tues"
elif x == 2:
return "Wed"
elif x == 3:
return "Thurs"
elif x == 4:
return "Fri"
elif x == 5:
return "Sat"
elif x == 6:
return "Sun"
grouped = df[["day_week","ENTRIESn_hourly"]].groupby("day_week")
ridership_day = grouped.aggregate(np.mean)
ridership_day = ridership_day.reset_index()
ridership_day["day_week"] = ridership_day["day_week"].map(convertDays)
gp.ggplot(ridership_day, gp.aes(x="day_week", y="ENTRIESn_hourly")) +\
gp.geom_bar(stat='identity') +\
gp.ggtitle('Average Ridership by Day of Week')
Out[449]:
The Mann-Whitney U test performed on the two datasets, ridership with rain and ridership without rain, showed that the datasets did not come from the same distribution. Along with a higher mean of ridership while it was raining, it can be concluded that people do ride the NYC subway more.
In addition to the Mann-Whitney U test, a linear regression was performed to see what other factors play a role on NYC ridership. With the coefficients calculated we can compare the coefficients related to non-rainy days and rainy days. As suspected the coefficient for the cond_Rain feature was much larger than cond_Clear feature.
From making the 'conds' column into a dummy variable, further insight can be seen in how more descriptive weather conditions play a role in subway ridership. By looking at the fog and mist condition's coefficient, it seems like they play a bigger role than rain in subway ridership.
In [ ]:
def css_styling():
styles = open("../css/custom.css", "r").read()
return disp.HTML(styles)
css_styling()
In [ ]:
In [ ]: